May 2, 2016

Homework 5

  • 分組名單
  • 期末報告題目
  • 預計分析資料
    • 資料來源
    • 資料格式
  • 預計分析議題
    • 假設
    • 預計得到的結果
    • 分析結果可以解決什麼問題
  • 5/2 (一) 11:59pm
  • http://goo.gl/forms/8UgvNQlHVp

靈感!?

其他題目?

  • Facebook 粉絲專頁分析,政治人物最愛說的話,說什麼詞彙like數會最多
  • NBA 各隊分析,勝利參數,球員相似度
  • You bike與天氣,空汙,Youbike適合分數
  • 離婚結婚生小孩死亡地圖
  • 鉛管,實價登錄,淹水
  • 氣溫,google搜尋
  • 水質,降雨量
  • 癌症死亡率跟工廠分佈
  • 台詞 電影 分類 rating

期末報告規定

  • 2-4人一組 —> 44/3=15組
  • 兩次上課,300分鐘 —> 一組可用20分鐘
  • 報告15分鐘,問問題5分鐘 —> 講重點,嚴格計時
  • 整組交一份書面報告
    • 組員與工作分配(會影響成績)、資料分析報告、資料討論/遇到的困難

期末報告時間

  • 5/30: 1組
  • 6/6: 4組
  • 6/13: 5組
  • 6/20: 5組

上課用程式碼

Plotting System–Base

探索圖範例:又是大家熟悉的NBA…

#讀入SportsAnalytics package
if (!require('SportsAnalytics')){
    install.packages("SportsAnalytics")
    library(SportsAnalytics)
}
#擷取2014-2015年球季球員資料
NBA1415<-fetch_NBAPlayerStatistics("14-15")

Simple Summaries of Data - 1

One dimension

  • Five-number summary summary(NBA1415$TotalPoints)
  • Boxplots 箱型圖 boxplot(NBA1415$TotalPoints)
  • Histograms 直方圖 hist(NBA1415$TotalPoints)
  • Density plot 密度圖 density(NBA1415$TotalPoints)
  • Barplot 條狀圖 barplot(table(NBA1415$Team))

Simple Summaries of Data - 2

Two dimensions

  • Multiple/overlayed 1-D plots (Lattice/ggplot2)
  • Scatterplots 散佈圖 plot(x,y)
  • Smooth scatterplots

\(> 2\) dimensions

  • Overlayed/multiple 2-D plots; coplots
  • Use color, size, shape to add dimensions
  • Spinning plots 類3D圖
  • Actual 3-D plots (not that useful)

Multiple Boxplots

#預計要做圖的'值'(TotalPoints) ~ 分組依據(Team)
boxplot(TotalPoints ~ Team, data = NBA1415, col = "red")

Multiple Histograms

#mfrow設定一張圖裡有多少子圖,mar設定邊界大小
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) #一張圖裡面有2x1個子圖
hist(subset(NBA1415, Team == "SAN")$TotalPoints, col = "green")
hist(subset(NBA1415, Team == "GSW")$TotalPoints, col = "green")

Scatterplot

par(mfrow = c(1, 1)) #一張圖裡面只有一個子圖
#畫x為TotalMinutesPlayed, y為TotalPoints的散佈圖
plot(NBA1415$TotalMinutesPlayed, NBA1415$TotalPoints)
#畫一條橫線h = 500,寬度lwd = 2,樣式lty = 2(虛線)
abline(h = 500, lwd = 2, lty = 2)

Scatterplot - Using Color

用顏色在二維散佈圖中加上第三維的資訊

#col=NBA1415$Team 用隊伍名稱著色,意指不同隊伍的球員不同色
plot(NBA1415$TotalMinutesPlayed, NBA1415$TotalPoints,col=NBA1415$Team)
abline(h = 500, lwd = 2, lty = 2)

Multiple Scatterplots

#mfrow設定一張圖裡有多少子圖,mar設定邊界大小
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1)) #一張圖裡面有1x2個子圖
with(subset(NBA1415, Team == "SAN"), #取得NBA1415中,隊伍是SAN的Row
     plot(TotalMinutesPlayed, TotalPoints, main = "SAN"))#main=標題
with(subset(NBA1415, Team == "GSW"), 
     plot(TotalMinutesPlayed, TotalPoints, main = "GSW"))

Summary

  • Exploratory plots are "quick and dirty"

  • Let you summarize the data (usually graphically) and highlight any broad features

  • Explore basic questions and hypotheses (and perhaps rule them out)

  • Suggest modeling strategies for the "next step"

Further resources

Plotting System -Lattice

The Lattice Plotting System

包括以下Packaes:

  • lattice: 包括畫圖相關的函數functions: xyplot, bwplot, levelplot

  • grid: lattice package 的基礎

  • 一個函數畫完圖,不能再加標記和文字等資料(和base畫圖法不同)

Lattice Functions

  • xyplot: 畫散佈圖 scatterplots
  • bwplot: 畫盒鬚圖box-and-whiskers plots (“boxplots”)
  • histogram: 直方圖 histograms
  • stripplot: 盒鬚圖+點
  • dotplot: dots on "violin strings"
  • splom: 散佈圖的矩陣
  • levelplot, contourplot: for plotting "image" data

Lattice Functions

xyplot(y ~ x | f ` g, data)
  • Lattice functions 通常第一個參數是 formula
  • y~x: y-axis~x-axis formula notation
  • f,g are conditioning variables — optional
    • ```: an interaction between two variables
  • 第二個參數是資料data
  • 其他參數都是預設值

Simple Lattice Plot

library(lattice)
library(datasets)
## Simple scatterplot
xyplot(Ozone ~ Wind, data = airquality) # y軸~x軸

Simple Lattice Plot

library(datasets)
library(lattice)
## Convert 'Month' to a factor variable
airquality <- transform(airquality, Month = factor(Month)) 
xyplot(Ozone ~ Wind | Month, #y軸~x軸 | 分組依據
       data = airquality, layout = c(5, 1)) # 5 rows, 1 column

Lattice Behavior-1

p <- xyplot(Ozone ~ Wind, data = airquality)  ## Nothing happens!
print(p)  ## Plot appears

Lattice Behavior-2

xyplot(Ozone ~ Wind, data = airquality)  ## Auto-printing

Summary

  • Lattice plots 用一個單一函數畫圖 (e.g. xyplot)
  • 空白和邊緣都有預設值
  • 想看一種圖在不同情況下的差異的時候,很適合

Plotting System -ggplot2

Plotting Systems in R: Base -1

  • “Artist’s palette” model
  • Start with blank canvas and build up from there
  • Start with plot function (or similar)
  • Use annotation functions to add/modify (text, lines, points, axis)

Plotting Systems in R: Base -2

  • Convenient, mirrors how we think of building plots and analyzing data
  • Can’t go back once plot has started (i.e. to adjust margins); need to plan in advance
  • Difficult to “translate” to others once a new plot has been created (no graphical “language”)
  • Plot is just a series of R commands

Plotting Systems in R: Lattice

  • Plots are created with a single function call (xyplot, bwplot, etc.)
  • Most useful for conditioning types of plots: Looking at how \(y\) changes with \(x\) across levels of \(z\)
  • Things like margins/spacing set automatically because entire plot is specified at once
  • Good for putting many many plots on a screen

Plotting Systems in R: Lattice-2

  • Sometimes awkward to specify an entire plot in a single function call
  • Annotation in plot is not intuitive
  • Use of panel functions and subscripts difficult to wield and requires intense preparation
  • Cannot “add” to the plot once it’s created

Plotting Systems in R: ggplot2

  • Split the difference between base and lattice
  • Automatically deals with spacings, text, titles but also allows you to annotate by “adding” +
  • Superficial similarity to lattice but generally easier/more intuitive to use
  • Default mode makes many choices for you (but you can customize!)

What is ggplot2?

  • An implementation of The Grammar of Graphics by Leland Wilkinson
  • Written by Hadley Wickham (while he was a graduate student at Iowa State)
  • A “third” graphics system for R (along with base and lattice)
  • Web site

What is ggplot2?

  • Grammar of graphics represents an abstraction of graphics ideas/objects
  • Think “verb”, “noun”, “adjective” for graphics
  • Allows for a “theory” of graphics on which to build new graphics and graphics objects

Grammer of Graphics

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”

  • from ggplot2 book

Grammer of Graphics–翻譯一下

基本元素:

  • Aesthetic attributes:包括顏色、形狀、點的大小與線的粗細等
  • Geometric objects:包括點、線、盒狀圖、直條圖等

其他元素包括:

  • Facets:提供在同一張圖內做多個子圖的方法,只要使用Faceting功能設定子圖分類的依據參數即可。
  • Stats:將資料做統計轉換
  • Scales:修改點線的顏色、形狀、xy軸的範圍等

The Basics: qplot()

  • Works much like the plot function in base graphics system
  • Looks for data in a data frame, similar to lattice
  • Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)

The Basics: qplot()

  • Factors are important for indicating subsets of the data; they should be labeled
  • The qplot() hides what goes on underneath, which is okay for most operations
  • ggplot() is the core function and very flexible for doing things qplot() cannot do

ggplot2 “Hello, world!”

library(ggplot2) #記得將ggplot2 package讀入,如果沒安奘記得先安裝
#qplot(x軸,y軸,data=使用資料)--->畫散佈圖
qplot(FieldGoalsAttempted, TotalPoints, data = NBA1415)

Modifying aesthetics

#color=Position, 用守備位置Position著色
qplot(FieldGoalsAttempted, TotalPoints, data = NBA1415,color=Position)

Adding a geom

#geom = c("point", "smooth") 在圖上加點與漸進線
qplot(FieldGoalsAttempted, TotalPoints, data = NBA1415,
      geom = c("point", "smooth"))

Histograms

#畫TotalPoints的直方圖/ fill = Position 並用守備位置Position著色
qplot(TotalPoints, data = NBA1415, fill = Position)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Facets

#qplot(x軸,y軸,data=使用資料)--->畫散佈圖
#facets = . ~ Position 用守備位置Position分群畫圖(橫向)
qplot(FieldGoalsAttempted, TotalPoints, data = NBA1415,
      facets = . ~ Position)

Facets

#facets = . ~ Position 用守備位置Position分群畫圖(直向)
qplot(FieldGoalsAttempted, TotalPoints, data = NBA1415,
      facets = Position ~ .)

qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)

Facets

#facets = . ~ Position 用守備位置Position分群畫圖(直向)
#binwidth = 100 每100分一組
qplot(TotalPoints, data = NBA1415,
      facets = Position ~ ., binwidth = 100)

Summary of qplot()

  • The qplot() function is the analog to plot() but with many built-in features
  • Syntax somewhere in between base/lattice
  • Produces very nice graphics, essentially publication ready (if you like the design)
  • Difficult to go against the grain/customize (don’t bother; use full ggplot2 power in that case)

Resources

What is ggplot2?

  • An implementation of the Grammar of Graphics by Leland Wilkinson
  • Grammar of graphics represents and abstraction of graphics ideas/objects
  • Think “verb”, “noun”, “adjective” for graphics
  • Allows for a “theory” of graphics on which to build new graphics and graphics objects

Basic Components of a ggplot2 Plot

  • A data frame
  • aesthetic mappings: how data are mapped to color, size
  • geoms: geometric objects like points, lines, shapes.
  • facets: for conditional plots.
  • stats: statistical transformations like binning, quantiles, smoothing.
  • scales: what scale an aesthetic map uses (example: male = red, female = blue).
  • coordinate system

Building Plots with ggplot2

  • When building plots in ggplot2 (rather than using qplot) the “artist’s palette” model may be the closest analogy
  • Plots are built up in layers
  • Plot the data
  • Overlay a summary
  • Metadata and annotation

Annotation

  • Labels: xlab(), ylab(), labs(), ggtitle()
  • Each of the “geom” functions has options to modify
  • For things that only make sense globally, use theme()
  • Example: theme(legend.position = "none")
  • Two standard appearance themes are included
  • theme_gray(): The default theme (gray background)
  • theme_bw(): More stark/plain

Other themes

點圖 in ggplot2 geom_point()

記得讀入ggplot2 packages

#aes: Aesthetic attributes, 顏色、形狀、點的大小與線的粗細
#geom_*: Geometric objects, 點、線、盒狀圖、直條圖
ggplot(NBA1415, aes(x = Position, y = TotalPoints)) +geom_point()

Box plot in ggplot2 geom_boxplot()

記得讀入ggplot2 packages

#aes: Aesthetic attributes, 顏色、形狀、點的大小與線的粗細
#geom_*: Geometric objects, 點、線、盒狀圖、直條圖
ggplot(NBA1415, aes(x = Position, y = TotalPoints)) +geom_boxplot()

Faceting in ggplot2 facet_grid()

#facet_grid: 加入子圖,Position~.:直向加入,.~Position:橫向加入
ggplot(NBA1415, aes(x = FieldGoalsAttempted, y = TotalPoints)) +
    geom_point()+facet_grid(Position~.)

趨勢線 in ggplot2 geom_smooth()

#geom_smooth: 加入趨勢線,method='lm':linear regression
ggplot(NBA1415, aes(x = FieldGoalsAttempted, y = TotalPoints)) +
    geom_point()+facet_grid(Position~.)+geom_smooth(method='lm')

Color group in ggplot2 color=?

#color=Position:用Position當作著色依據
ggplot(NBA1415, aes(x = FieldGoalsAttempted, y = TotalPoints,color=Position)) +
    geom_point()+geom_smooth(method='lm')

A Note about Axis Limits

testdat <- data.frame(x = 1:100, y = rnorm(100))
testdat[50,2] <- 100  ## Outlier!
plot(testdat$x, testdat$y, type = "l", ylim = c(-3,3))

A Note about Axis Limits

g <- ggplot(testdat, aes(x = x, y = y))
g + geom_line()

Axis Limits

g + geom_line() + ylim(-3, 3)

Axis Limits

g + geom_line() + coord_cartesian(ylim = c(-3, 3))

ggplot2參考資料

ggplot2與地圖的結合

Choropleth map

  • 面量圖
  • 把統計資料用顏色畫在對應的地圖上
  • choroplethr package
  • 基於ggplot2 package所做的專門畫面量圖的工具
if (!require('choroplethr')){
    install.packages("choroplethr")
    library(choroplethr)
}

美國各州人口分布

用到choroplethr package, 記得先讀入

data(df_pop_state) #記載各州人口數的資料
state_choropleth(df_pop_state) #把各州人口畫在地圖上

美國各州人口分布-2

用到choroplethr package, 記得先讀入

data(continental_us_states)
state_choropleth(df_pop_state,reference_map = TRUE,
                 zoom= continental_us_states) #把各州人口畫在地圖上

世界人口分布

choroplethr package畫地圖,資料來自WDI package

if (!require('WDI')){
    install.packages("WDI")
    library(WDI)
}
choroplethr_wdi(code="SP.POP.TOTL", year=2014, 
                title="2016 Population", num_colors=1)

世界平均壽命分布

choroplethr_wdi(code="SP.DYN.LE00.IN", year=2014, 
                title="2014 Life Expectancy")

亞洲太平洋人口分布

  • choroplethr package畫地圖,資料來自WDI package
  • WDI: World Development Indicators
  • zoom,只畫這些國家,名字要和country.regions資料完全吻合
choroplethr_wdi(code="SP.POP.TOTL", year=2014, 
                title="2014 Life Expectancy",
                zoom=c('taiwan','japan','south korea','philippines'))

Taiwan?

  • 還沒有好用的package可用
  • 只好自己從頭來了
  • 下載台灣的地圖資料 DIVA-GIS
  • 將下載的TWN_adm解壓縮,放到專案資料夾內
  • in shapefile Wiki
  • 空間資料開放格式
  • 參考資料

將shapefile讀入R

使用maptools package 的readShapeSpatial function

if (!require('rgdal')){
    install.packages("rgdal");library(rgdal)
}
if (!require('gpclib')){
    install.packages("gpclib");library(gpclib)
}
if (!require('rgeos')){
    install.packages("rgeos");library(rgeos)
}
if (!require('maptools')){
    install.packages("maptools");library(maptools)
}
tw_shp <- readShapeSpatial("TWN_adm/TWN_adm2.shp")
names(tw_shp) #看tw_shp中各個資料的名字
##  [1] "ID_0"       "ISO"        "NAME_0"     "ID_1"       "NAME_1"    
##  [6] "ID_2"       "NAME_2"     "VARNAME_2"  "NL_NAME_2"  "HASC_2"    
## [11] "CC_2"       "TYPE_2"     "ENGTYPE_2"  "VALIDFR_2"  "VALIDTO_2" 
## [16] "REMARKS_2"  "Shape_Leng" "Shape_Area"

處理shapefile-1

  • 需要rgdal, rgeos,gpclib
  • fortify: 將shapefile物件轉為data.frame
  • 參考資料
print(tw_shp$NAME_2)
##  [1] Kaohsiung City Taipei City    Changhwa       Chiayi        
##  [5] Hsinchu        Hualien        Ilan           Kaohsiung     
##  [9] Keelung City   Miaoli         Nantou         Penghu        
## [13] Pingtung       Taichung       Taichung City  Tainan        
## [17] Tainan City    Taipei         Taitung        Taoyuan       
## [21] Yunlin        
## 21 Levels: Changhwa Chiayi Hsinchu Hualien Ilan ... Yunlin
tw_shp.df <- fortify(tw_shp, region = "ID_2")

處理shapefile-2

head(tw_shp.df)
##       long      lat order  hole piece    id   group
## 1 120.2390 22.75155     1 FALSE     1 33637 33637.1
## 2 120.2701 22.74135     2 FALSE     1 33637 33637.1
## 3 120.2996 22.70920     3 FALSE     1 33637 33637.1
## 4 120.3148 22.64980     4 FALSE     1 33637 33637.1
## 5 120.3168 22.61033     5 FALSE     1 33637 33637.1
## 6 120.3009 22.60195     6 FALSE     1 33637 33637.1

做一個假資料來畫

#做一個假資料來畫
mydata<-data.frame(NAME_2=tw_shp$NAME_2, id=tw_shp$ID_2,
                   prevalence=1:length(tw_shp$NAME_2))
head(mydata)
##           NAME_2    id prevalence
## 1 Kaohsiung City 33637          1
## 2    Taipei City 33638          2
## 3       Changhwa 33639          3
## 4         Chiayi 33640          4
## 5        Hsinchu 33641          5
## 6        Hualien 33642          6
final.plot<-merge(tw_shp.df,mydata,by="id",all.x=T)
head(final.plot)
##      id     long      lat order  hole piece   group         NAME_2
## 1 33637 120.2390 22.75155     1 FALSE     1 33637.1 Kaohsiung City
## 2 33637 120.2701 22.74135     2 FALSE     1 33637.1 Kaohsiung City
## 3 33637 120.2996 22.70920     3 FALSE     1 33637.1 Kaohsiung City
## 4 33637 120.3148 22.64980     4 FALSE     1 33637.1 Kaohsiung City
## 5 33637 120.3168 22.61033     5 FALSE     1 33637.1 Kaohsiung City
## 6 33637 120.3009 22.60195     6 FALSE     1 33637.1 Kaohsiung City
##   prevalence
## 1          1
## 2          1
## 3          1
## 4          1
## 5          1
## 6          1

畫台灣地圖-1

library(RColorBrewer) #配色用brewer.pal( 9 , "Reds" )
twmap<-ggplot() +
    geom_polygon(data = final.plot, 
                 aes(x = long, y = lat, group = group, 
                     fill = prevalence), 
                 color = "black", size = 0.25) + 
    coord_map()+
    scale_fill_gradientn( colours = brewer.pal(9,"Reds"))+
    theme_void()+
    labs(title="Prevalence of X in Taiwan")

畫台灣地圖-2

twmap

ggmap,把google map載入

library(ggmap)
if (!require('ggmap')){
    install.packages("ggmap")
    library(ggmap)
}
twmap <- get_map(location = 'Taiwan', zoom = 7,language = "zh-TW")
#location:可以是地名,也可以是經緯度座標
#zoom:放大比例
#language:地圖語言

ggmap,把google map載入

ggmap(twmap) #基於ggplot2物件,可用相同方式處理

ggmap實際應用範例-1

台北市水質地圖,資料處理部分

http://data.taipei/opendata/datalist/apiAccess?scope=resourceAquire&rid=190796c8-7c56-42e0-8068-39242b8ec927

library(jsonlite)
WaterData<-fromJSON("http://data.taipei/opendata/datalist/apiAccess?scope=resourceAquire&rid=190796c8-7c56-42e0-8068-39242b8ec927")
WaterDataFrame<-WaterData$result$results
WaterDataFrame$longitude<-as.numeric(WaterDataFrame$longitude)
WaterDataFrame$latitude<-as.numeric(WaterDataFrame$latitude)
WaterDataFrame$qua_cntu<-as.numeric(WaterDataFrame$qua_cntu)

ggmap實際應用範例-2

台北市水質地圖,畫圖部分

library(ggmap)
TaipeiMap = get_map(location = c(121.43,24.93,121.62,25.19), 
                    zoom = 11, maptype = 'roadmap')
TaipeiMapO = ggmap(TaipeiMap)+ 
    geom_point(data=subset(WaterDataFrame,qua_cntu>=0), 
               aes(x=longitude, y=latitude,color=qua_cntu,size=3.5))+ 
    scale_color_continuous(low = "yellow",high = "red")+ 
    guides(size=FALSE)

ggmap實際應用範例-3

台北市水質地圖,畫圖部分

TaipeiMapO